Group Project

Learning Quarto

Author

Denise Simonetti, Juan David Londono Betancur, Mehdi Rghazni

Published

March 24, 2024

Abstract
Programming for Business Analytics

Loading Packages

Code
library(tidyverse)
library(janitor)
library(dplyr)
library(lubridate)
library(readxl)
library(DT)
library(knitr)
library(gt)
library(dataMaid)
library(hms)
library(ggplot2)
library(knitr)
library(readr)
library(RSQLite)
library(dbplyr)
library(tidytext)
library(caret)
library(gtExtras)
library(esquisse)
library(PerformanceAnalytics)
library(ggrepel)
library(ggstatsplot)
library(corrplot)
library(pROC)

Executive Summary

Customers churning is a serious problem for telecom companies as it is more expensive to acquire new customers than it is to keep existing ones from leaving. Telecom wants to retain existing customers by providing suitable intervention to the customers who are likely to churn but needs evidence proving that the targeted approach is accurate so that they do not waste money on customers that would have stayed at the company anyway. As the business analytics team, we were tasked to address the churning problem.

The first step in our analysis was to examine the churning problem across the firm using descriptive analytics. Our analysis showed that churners differ from non-churners on important customer characteristics (number of voice messages recorded, international plan, voice mail plan, day minutes, evening minutes, night minutes, international calls, and number of customer service calls made). We also observed that as much as 14.49% of customers churn.

Secondly, we built a logistic regression model to assess the predictive accuracy of the targeted approach by utilizing this specific predictive model for the purpose of classification. The analytics team evaluated the logistic regression model on important metrics of the holdout data set and the quality and amount of information that it provides. The logistic regression model is not good at predicting churners. While the model has an accuracy of 85% on the holdout data, its precision and recall only amount to, respectively, 46% and 16%. These results point to the conclusion that this is not the best model to approach the churning problem. We recommend that the company works on developing a better predictive model to address churn.

For the next round of statistical analysis, the analytics team will work on fixing the imbalanced classification problem (the class churners is under-represented compared to non-churners). In addition, we will build a classification tree and a neural network in an attempt to improve the predictive accuracy of churners.

Problem Description

Business Opportunity

Retaining existing customers is crucial for every telecom company as it is more expensive for them to acquire new customers than it is to keep the existing ones for leaving. Telecom wants to retain existing customer by providing suitable intervention to the customers who are likely to churn but needs evidence proving that this targeted approach is accurate.

Business Objectives

To address the churning problem at Telecom, the business objectives of this project were to examine the churning problem across the firm and to assess the accuracy of the targeted approach in predicting which customers are likely to churn.

Research Questions

Using a data analytics strategy, the analytics team conducted a study to answer the following questions:
1) How do customers who churned compare to the customers who did not churn with respect to location, account length, type of telephone plan used, number of calls and mail messages, minutes used, the amount of money they got charged for different services, and the number of customer calls made?
2) To what extent do customers churn?
3) Considering Telecom customer characteristics, is logistic regression an effective model for targeting customers who are likely to churn?

Data Gathering Process

To conduct this study, the analytics team collected data from Kaggle at https://www.kaggle.com/datasets/mnassrib/telecom-churn-datasets/data.

The data contained 1 target variable (churners vs non-churners) and and 18 input variables (state, account length, international plan, voice mail plan, number of voice messages recorded, totally day calls made, total evening calls made, total night calls made, total international calls made, total day minutes used, total evening minutes, used, total night minutes used, total international minutes used, total day charge, total evening charge, total night charge, total international charge, and number of customer service calls made).

In our analyses, we excluded the input variables “state” and “area code” as the project had a tight deadline. At a later point in time, this variable can be incorporated for further analysis.

The process of cleaning the data also involved replacing the Yes/No with True/False in the variables “international plan” and “voice mail plan”.

Code
# Data import and cleaning

churn_20 <- read.csv("churn-bigml-20.csv") %>%
  clean_names()

churn_80 <- read.csv("churn-bigml-80.csv") %>%
  clean_names()

# Data join and column reordering

telecom_churn <- rbind(churn_20, churn_80) %>%
  select(-c(state,area_code)) %>%
  rename(voice_mail_messages = "number_vmail_messages") %>%
  select("account_length","international_plan","voice_mail_plan","voice_mail_messages","total_day_calls","total_eve_calls","total_night_calls","total_intl_calls","total_day_minutes","total_eve_minutes","total_night_minutes","total_intl_minutes","total_day_charge","total_eve_charge","total_night_charge","total_intl_charge","customer_service_calls","churn")

# Data preparation for columns international_plan and voice_mail_plan

telecom_churn <- telecom_churn %>%
  mutate(international_plan = ifelse(test = international_plan == "Yes", 
                                     yes = "True",
                                     no = "False")) %>%
  mutate(voice_mail_plan = ifelse(test = voice_mail_plan == "Yes", 
                                     yes = "True",
                                     no = "False"))

Data Analysis Plan

To reach the business objectives of this study, the analytics team used both descriptive analytics and predictive analytics. More specifically, we used descriptive analytics to examine the churning problem across the firm (questions 1 and 2). We then used predictive analytics to build a logistic regression model in order to assess the accuracy of the targeted approach in predicting which customers are likely to churn (question 3).

All the analysis was done utilizing r.

Data Description

The data used in this study contained 1 target variable (churners vs non-churners), as well as 17 potential predictors of churn:

Table1
Input Variable Definition
Account Length Number of days since the customer registered his/her account
International Plan Customer has an international plan: True/False
Voice Mail Plan Customer has a voice mail plan: True/False
Voice Mail Messages Number of voice mail messages recorded
Total Day Calls Total number of calls used during the day
Total Evening Calls Total number of calls used during the evening
Total Night Calls Total number of calls used during the night
Total International Calls Total number of international calls used
Total Day Minutes Total number of minutes used during the day
Total Evening Minutes Total number of minutes used during the evening
Total Night Minutes Total number of minutes used during the night
Total International Minutes Total number of minutes used on international calls
Total Day Charge Total dollars charged by day minutes used
Total Evening Charge Total dollars charged by evening minutes used
Total Night Charge Total dollars charged by night minutes used
Total International Charge Total dollars charged by international minutes used
Customer Service Calls Customer service calls made

Exploratory Data Analysis

Thanks to the report created with the package dataMaid, we can see that there are no missing values in the data set. Additionally, we observe the variable class (character, integer, and numeric) and the number of unique values for each variable.

Just to describe a few variables, let’s observe “account length”, “churn”, and “total international calls”.

Account length: We can see that the variable type of “account_length” is integer. The data is basically normally distributed. We can also observe than the values range from 1 to 243, with 101 being the median. The report recognized the numbers from 208 to 243 as possible outliers.

Churn: We can see that the variable type of “churn” is character (we will change the variable type to logical later on). We can see clearly observe that the mode is False (non-churners).

Total international calls: We can see that the variable type of “total_intl_calls” is integer. The data isright-skewed since the right tail is longer than the left tail. We can also observe than the values range from 0 to 20, with 4 being the median.

Code
makeDataReport(telecom_churn,output = 'html', replace = TRUE)
Code
knitr::include_graphics("/Users/denis/OneDrive/Desktop/Programming for BA/NEOMA_DataProgramming-main/NEOMA_DataProgramming-main/dataMaid_telecom_churn.html")

Solution: Descriptive Analytics

Differences Between Churners and Non-Churners

As shown by the visualizations below, we see the most differences between churners and non-churners in the following variables: number of voice messages recorded, international plan, voice mail plan, day minutes, evening minutes, night minutes, international calls, and number of customer service calls made.

Code
attributes_list <- c("account_length","total_day_calls","total_eve_calls","total_night_calls","total_intl_calls","total_day_minutes","total_eve_minutes","total_night_minutes","total_intl_minutes","total_day_charge","total_eve_charge","total_night_charge","total_intl_charge","customer_service_calls")

churn_yes_no <- telecom_churn %>%
  group_by(churn) %>%
  summarise(across(.cols = all_of(attributes_list),
                   .fns = list(median = median, min = min, max = max),
                   .names = "{.fn}_{.col}"))

mean_voice_mail_messages <- telecom_churn %>%
  group_by(churn) %>%
  summarise(mean_voice_mail_messages = mean(voice_mail_messages, na.rm = TRUE))

# Merging all summary statistics

telecom_summary_s <- merge(churn_yes_no, mean_voice_mail_messages, by = intersect(names(churn_yes_no), names(mean_voice_mail_messages)))

telecom_summary_s <- as.data.frame(telecom_summary_s)

# Transposing the table 

telecom_summary_s <- telecom_summary_s %>%
  t()

colnames(telecom_summary_s) <- telecom_summary_s[1, ]

telecom_summary_s <- telecom_summary_s[-1, ]

telecom_summary_s <- as.data.frame(telecom_summary_s)


# Creating clear separations between Metrics and Statistics

telecom_summary_s <- telecom_summary_s %>%
  rownames_to_column(var = "Metrics")


telecom_summary_s <- telecom_summary_s%>%
  mutate(Statistics = str_extract(Metrics, "^[^_]*")) %>%
  mutate(Metric = str_extract(Metrics, "(?<=_).*")) %>%
  mutate(Metric = str_replace_all(Metric, "_", " ")) %>%
  select(-Metrics) %>%
  select(Metric, Statistics, True, False) %>%
  mutate(Metric = ifelse(duplicated(Metric), "", Metric)) %>%
  gt() %>%
  tab_spanner(label = "Churner", columns = c(True, False)) %>%
  tab_header(title = "Telecom Summary Statistics", subtitle = "Grouped by Metric")

telecom_summary_s
Telecom Summary Statistics
Grouped by Metric
Metric Statistics Churner
True False
account length median 103 100
min 1 1
max 225 243
total day calls median 103 100
min 0 0
max 165 163
total eve calls median 101 100
min 48 0
max 168 170
total night calls median 100 100
min 49 33
max 158 175
total intl calls median 4 4
min 1 0
max 20 19
total day minutes median 217.6 177.2
min 0 0
max 350.8 315.6
total eve minutes median 211.3 199.6
min 70.9 0.0
max 363.7 361.8
total night minutes median 204.80 200.25
min 47.4 23.2
max 354.9 395.0
total intl minutes median 10.6 10.2
min 2 0
max 20.0 18.9
total day charge median 36.99 30.12
min 0 0
max 59.64 53.65
total eve charge median 17.96 16.97
min 6.03 0.00
max 30.91 30.75
total night charge median 9.22 9.01
min 2.13 1.04
max 15.97 17.77
total intl charge median 2.86 2.75
min 0.54 0.00
max 5.4 5.1
customer service calls median 2 1
min 0 0
max 9 8
voice mail messages mean 5.115942 8.604561
Code
churners <- subset(telecom_churn, churn == "True")
non_churners <- subset(telecom_churn, churn == "False")

ggplot(churners) +
 aes(x = international_plan) +
 geom_bar(fill = "#112446") +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
 labs(x = "International plan users", 
 y = "Count of users", title = "International Plan", subtitle = "Churners") +
 theme_minimal()

Code
ggplot(non_churners) +
 aes(x = international_plan) +
 geom_bar(fill = "#EF562D") +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
 labs(x = "International plan users", 
 y = "Count of users", title = "International Plan", subtitle = "Non-churners") +
  theme_minimal()

Code
churners <- subset(telecom_churn, churn == "True")
non_churners <- subset(telecom_churn, churn == "False")

ggplot(churners) +
 aes(x = voice_mail_plan) +
 geom_bar(fill = "#112446") +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
 labs(x = "Voice mail plan users", 
 y = "Count of users", title = "Voice Mail Plan", subtitle = "Churners") +
 theme_minimal()

Code
ggplot(non_churners) +
 aes(x = voice_mail_plan) +
 geom_bar(fill = "#EF562D") +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
 labs(x = "Voice mail plan users", 
 y = "Count of users", title = "Voice Mail Plan", subtitle = "Non-churners") +
  theme_minimal()

T-test Analysis

There is also evidence of a statistically significant difference between the customer who churn and the customers who do not churn with respect to voice_mail_messages, total_intl_calls, total_day_minutes, total_eve_minutes, total_night_minutes, total_intl_minutes, total_day_charge, total_eve_charge, total_night_charge, total_intl_charge, and customer_service_calls since the p-value is less than α (0.05).

The only variables that have no statistical difference between churners and non-churners are account_length, total_day_calls, total_eve_calls, and total_night_calls since the p-value is greater than α (0.05).

Code
# T Test
telecom_churn$churn <- as.logical(telecom_churn$churn)

# List of variables to test
variables_to_test <- c("account_length", "voice_mail_messages", "total_day_calls",
                       "total_eve_calls", "total_night_calls", "total_intl_calls",
                       "total_day_minutes", "total_eve_minutes",
                       "total_night_minutes", "total_intl_minutes",
                       "total_day_charge", "total_eve_charge",
                       "total_night_charge", "total_intl_charge",
                       "customer_service_calls")

# Create an empty dataframe to store t-test results
t_test_results <- data.frame(variable = character(),
                              t_ratio = numeric(),
                              p_value = numeric(),
                              stringsAsFactors = FALSE)

# Perform t-test for each variable and store results in dataframe
for (variable in variables_to_test) {
  # Perform t-test
  t_test_result <- t.test(telecom_churn[[variable]] ~ telecom_churn$churn)
  
  # Extract t-ratio and p-value from t-test result
  t_ratio <- t_test_result$statistic
  p_value <- t_test_result$p.value
  
  # Append results to dataframe
  t_test_results <- rbind(t_test_results, data.frame(variable = variable, 
                                                     t_ratio = t_ratio,
                                                     p_value = p_value))
}

t_test_results$p_value <- format(t_test_results$p_value, scientific = FALSE)

t_test_results$p_value <- as.numeric(t_test_results$p_value) %>%
  round(3)

t_test_results$t_ratio <- as.numeric(t_test_results$t_ratio) %>%
  round(3)

t_test_results %>%
  gt() %>%
  tab_header(title = "T-test")
T-test
variable t_ratio p_value
account_length -0.962 0.336
voice_mail_messages 5.821 0.000
total_day_calls -1.002 0.317
total_eve_calls -0.537 0.591
total_night_calls -0.349 0.727
total_intl_calls 2.960 0.003
total_day_minutes -9.685 0.000
total_eve_minutes -5.272 0.000
total_night_minutes -2.171 0.030
total_intl_minutes -3.939 0.000
total_day_charge -9.684 0.000
total_eve_charge -5.272 0.000
total_night_charge -2.171 0.030
total_intl_charge -3.940 0.000
customer_service_calls -8.955 0.000
Code
# account length t-test

ggbetweenstats(data = telecom_churn, x = "churn", y = "account_length") + ggtitle("T-test account length")

Code
# voice mail messages t-test

ggbetweenstats(data = telecom_churn, x = "churn", y = "voice_mail_messages") + ggtitle("T-test voice mail messages")

Code
# calls t-test

ggbetweenstats(data = telecom_churn, x = "churn", y = "total_day_calls") + ggtitle("T-test total day calls")

Code
ggbetweenstats(data = telecom_churn, x = "churn", y = "total_eve_calls") + ggtitle("T-test total eve calls")

Code
# calls t-test

ggbetweenstats(data = telecom_churn, x = "churn", y = "total_night_calls") + ggtitle("T-test total night calls")

Code
ggbetweenstats(data = telecom_churn, x = "churn", y = "total_intl_calls") + ggtitle("T-test total intl calls")

Code
# minutes t-test

ggbetweenstats(data = telecom_churn, x = "churn", y = "total_day_minutes") + ggtitle("T-test total day minutes")

Code
ggbetweenstats(data = telecom_churn, x = "churn", y = "total_eve_minutes") + ggtitle("T-test total eve minutes")

Code
# minutes t-test

ggbetweenstats(data = telecom_churn, x = "churn", y = "total_night_minutes") + ggtitle("T-test total night minutes")

Code
ggbetweenstats(data = telecom_churn, x = "churn", y = "total_intl_minutes") + ggtitle("T-test total_intl_minutes")

Code
# Charges t-test

ggbetweenstats(data = telecom_churn, x = "churn", y = "total_day_charge") + ggtitle("T-test total day charge")

Code
ggbetweenstats(data = telecom_churn, x = "churn", y = "total_eve_charge") + ggtitle("T-test total eve charge")

Code
# Charges t-test

ggbetweenstats(data = telecom_churn, x = "churn", y = "total_night_charge") + ggtitle("T-test total night charge")

Code
ggbetweenstats(data = telecom_churn, x = "churn", y = "total_intl_charge") + ggtitle("T-test total intl charge")

Code
# Customer service calls t-test

ggbetweenstats(data = telecom_churn, x = "churn", y = "customer_service_calls") + ggtitle("T-test customer service calls")

Percentage of Churners

Based on the descriptive statistics analysis, we also observe that 14.49% (483 out of 3333) of the customers of Telecom churned. On the other hand, 85.51% (2850 out of 3333) of the customers of Telecom did not churn.

Table

Code
counts <- table(telecom_churn$churn)
percentages <- prop.table(counts) * 100

# Step 2: Create a data frame
result <- data.frame(Category = c("Non-Churners", "Churners"),
                     Count = as.numeric(counts),
                     Percentage = round(percentages, 2))

result$Percentage.Freq <- paste0(result$Percentage.Freq, "%")
result <- result %>% select(-Percentage.Var1)

# Step 3: Create gt table
result %>% gt()
Category Count Percentage.Freq
Non-Churners 2850 85.51%
Churners 483 14.49%

Chart

Code
ggplot(result, aes(x = Category, y = Percentage.Freq, fill = Category)) +
  geom_bar(stat = "identity") +
  labs(title = "Churners vs Non-Churners",
       x = "Category",
       y = "Percentage") +
  theme_minimal() +
  theme(legend.position = "none")

Data Preparation for Modeling

Since the primary goal of predictive modeling is evaluating the predictive accuracy of the model on new data (holdout data set), the analytics team divided the data into a training and holdout data. More specifically, the analytics team put 75% of data records (2501 out of 3,333) in the training set and the remaining 25% (832 out of 3,333) in the holdout data set.

Code
# Create a binary column for churn
telecom_churn$churn <- as.logical(telecom_churn$churn)

index <- createDataPartition(telecom_churn$churn, p = 0.75, list = FALSE)
training_data <- telecom_churn[index, ]
holdout_data <- telecom_churn[-index, ]

Before starting to perform the logistic regression model, we run a correlation analysis with a standardized data set to see if some highly correlated variables need to be removed.

We discovered that the following variables are perfectly correlated (100%):
- total_day_minutes and total_day_charge
- total_eve_minutes and total_eve_charge
- total_night_minutes and total_night_charge
- total_intl_minutes and total_intl_charge

For this reason, the analytics team decided to remove total_day_charge, total_eve_charge, total_night_charge, and total_intl_charge from the logistic regression model.

Code
data_standardized <- telecom_churn

attributes_list <- c("account_length","total_day_calls","total_eve_calls","total_night_calls","total_intl_calls","total_day_minutes","total_eve_minutes","total_night_minutes","total_intl_minutes","total_day_charge","total_eve_charge","total_night_charge","total_intl_charge","customer_service_calls")

standardize_z_score <- function(x) {
  (x - mean(x)) / sd(x)
}

data_standardized[attributes_list] <- lapply(data_standardized[attributes_list], standardize_z_score)

attributes <- data_standardized %>%
  select(-c(churn, international_plan, voice_mail_plan))

correlation_matrix <- cor(attributes)

# Display correlation matrix in a nice table format
corrplot(correlation_matrix, method = "number", type = "upper", 
         diag = TRUE, tl.cex = 0.7, tl.col = "black",
         addCoef.col = "black", addCoefasPercent = FALSE, bg = "red")

Eigen Values

In the provided list of eigenvalues, the range of values is relatively small, with the largest eigenvalue being approximately 1.0878 and the smallest non-zero eigenvalue being approximately 0.9392.

While there is some variance among the eigenvalues, the difference in magnitudes between the largest and smallest eigenvalues is not substantial. Therefore, based solely on the eigenvalues provided, it does not appear there is severe multi-collinearity present in your data.

Code
attributes1 <- data_standardized %>%
  select(-c(churn, international_plan, voice_mail_plan,
            total_day_charge, total_eve_charge,
            total_night_charge, total_intl_charge))

correlation_matrix1 <- cor(attributes1)

eigen_values <- eigen(correlation_matrix1)$values
print(eigen_values)
 [1] 1.0878000 1.0555177 1.0399788 1.0225715 1.0003321 0.9885937 0.9717228
 [8] 0.9691722 0.9629284 0.9622282 0.9391545

Final Predictors

The analytics team decided to exclude total_day_calls, total_eve_calls, and total_night calls as predictors of churn since they are theoretically related to their respective total of minutes (they represent the same thing).

Based on the correlation analysis performed, the final list of predictors to be included in the model is the following: account_length, international_plan, total_intl_calls, voice_mail_plan, voice_mail_messages, total_day_minutes, total_eve_minutes, total_night_minutes, total_intl_minutes, customer_service_calls.

Therefore, the logistic regression will be performed utilizing these 10 predictors of churn.

Code
training_data <- training_data %>%
  select(-c(total_day_charge, total_eve_charge,
            total_night_charge, total_intl_charge,
            total_day_calls, total_eve_calls, total_night_calls))

holdout_data <- holdout_data %>%
  select(-c(total_day_charge, total_eve_charge,
            total_night_charge, total_intl_charge,
            total_day_calls, total_eve_calls, total_night_calls))

Solution: Predictive Analytics

The analytics team decided to utilize the logistic regression model. The logistic regression model allows decision makers to predict the probability of a particular categorical response for a given set of independent variables. Logistic regression models use a function of Y called logit. This logit can be mapped back to a probability which, in turn, can be mapped back to a class (churners vs non-churners).

The following image contains the estimated coefficients for each predictor variable of churn, their standard errors, z-values, and the associated p-values.
The significance codes at the bottom explain the threshold for the p-values denoted by asterisks. For instance, *** next to a coefficient’s p-value indicates a very high level of statistical significance.

The AIC (Akaike Information Criterion) is a measure of the relative quality of the statistical model for a given set of data. The lower the AIC, the better the model.

Overall, the model is statistically significant, with several predictors showing strong associations with the likelihood of churn.

Code
logistic_model <- glm(training_data$churn ~ ., data = training_data, family = binomial(link = "logit"))

summary(logistic_model)

Call:
glm(formula = training_data$churn ~ ., family = binomial(link = "logit"), 
    data = training_data)

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -8.651964   0.629467 -13.745  < 2e-16 ***
account_length          0.002195   0.001607   1.366 0.171831    
international_planTrue  1.994215   0.171274  11.643  < 2e-16 ***
voice_mail_planTrue    -2.360739   0.662644  -3.563 0.000367 ***
voice_mail_messages     0.045054   0.020680   2.179 0.029357 *  
total_intl_calls       -0.090857   0.029435  -3.087 0.002024 ** 
total_day_minutes       0.013013   0.001258  10.345  < 2e-16 ***
total_eve_minutes       0.007496   0.001317   5.692 1.26e-08 ***
total_night_minutes     0.004332   0.001289   3.361 0.000777 ***
total_intl_minutes      0.101902   0.023661   4.307 1.66e-05 ***
customer_service_calls  0.545444   0.045119  12.089  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2071.8  on 2500  degrees of freedom
Residual deviance: 1618.7  on 2490  degrees of freedom
AIC: 1640.7

Number of Fisher Scoring iterations: 6

In this logit equation, p represents the probability of a customer churning, while the predictors represent various characteristics or behaviors of the customer.

Just to describe a few arguments of the logit function, let’s observe what the intercept and international plan represent:
- Intercept (-8.1375) is the baseline log-odds of a customer churning when all predictor variables are zero.
- international_planTrue (2.1135): Having an international plan increases the log-odds of a customer churning by 2.1135 units.

Code
coefficients <- coef(logistic_model)

rounded_coefficients <- round(coefficients, 4)

cat("logit(p) = ", rounded_coefficients[1], " + ", paste(rounded_coefficients[-1], names(rounded_coefficients[-1]), sep = "*", collapse = " + "))
logit(p) =  -8.652  +  0.0022*account_length + 1.9942*international_planTrue + -2.3607*voice_mail_planTrue + 0.0451*voice_mail_messages + -0.0909*total_intl_calls + 0.013*total_day_minutes + 0.0075*total_eve_minutes + 0.0043*total_night_minutes + 0.1019*total_intl_minutes + 0.5454*customer_service_calls
Code
predicted_churn <- predict(logistic_model, newdata = holdout_data, type = "response")

Predicted_Churn <- ifelse(predicted_churn > 0.5, 1, 0)

confusion_matrix <- table(holdout_data$churn, Predicted_Churn)

confusion_matrix_gt <- as.data.frame.matrix(confusion_matrix)

confusion_matrix_gt %>%
  as_tibble(rownames = "Actual") %>%
  gt() %>%
  tab_header(title = "Confusion Matrix Holdout Data")
Confusion Matrix Holdout Data
Actual 0 1
FALSE 690 22
TRUE 86 34
Confusion Matrix Metrics
Code
TN <- confusion_matrix_gt[1, 1]
FP <- confusion_matrix_gt[1, 2]
FN <- confusion_matrix_gt[2, 1]
TP <- confusion_matrix_gt[2, 2]

total <- sum(confusion_matrix_gt)

# Metrics
accuracy <- (TP + TN) / total

misclassification_error <- (FP + FN) / total

specificity_TNR = TN / (TN + FP)

recall <- TP / (TP + FN)

precision <- TP / (TP + FP)

f1_score <- 2 * precision * recall / (precision + recall)

# Data frame with the results
results <- data.frame(
  Metric = c("Accuracy", "Misclassification Error", "Specificity (TNR)", "Recall (Sensitivity)", "Precision", "F1 Score"),
  Value = c(accuracy, misclassification_error, specificity_TNR, recall, precision, f1_score))

# gt table
results %>%
  gt() %>%
  tab_header(title = "Model Evaluation Metrics") %>%
  fmt_number(
    columns = c(Value),
    decimals = 2)
Model Evaluation Metrics
Metric Value
Accuracy 0.87
Misclassification Error 0.13
Specificity (TNR) 0.97
Recall (Sensitivity) 0.28
Precision 0.61
F1 Score 0.39

The ROC curve helps analyze the performance of classification at various threshold settings. High true positive rates (sensitivity) of a class means that the model has performed well in classifying that particular class. The AUC (area under the curve) ranges from 0.5 (random guessing) to 1 (perfect discrimination between classes). The higher the AUC, the better the model performs.

In our case, the area under the curve of the holdout data set is 84% (AUC = 0.842). This means that the model does not perform great in predicting true positives at various threshold settings. The analytics team recommends developing a better model to predict churners and provide incentives to them.

Code
predicted_probabilities <- predict(logistic_model, newdata = holdout_data, type = "response")

roc_curve <- roc(holdout_data$churn, predicted_probabilities)

auc_value <- auc(roc_curve)
auc_rounded <- round(auc_value, 3)

print(paste("AUC:", auc_rounded))
[1] "AUC: 0.816"
Code
## Does not work. It is reversed
plot.new()
plot(1, type = "n", xlim = c(0, 1), ylim = c(0, 1), 
     xlab = "False Positive Rate", ylab = "True Positive Rate", 
     main = "ROC Curve")

# Plot ROC curve
lines(roc_curve, col = "blue", lwd = 2)

# Add diagonal line for baseline
abline(a = 0, b = 1, col = "black", lty = 2)

# Add legend
legend("bottomright", legend = c("ROC Curve", "Baseline (AUC = 0.5)"), 
       col = c("blue", "black"), lty = c(1, 2), lwd = c(2, 1))

Conclusion and Recommendations

The logistic regression model is not good at predicting churners. While the model has an accuracy of 85% on the holdout data, its precision and recall only amount to, respectively, 46% and 16% (the model predicts non-churners much better than churners, most probably as an effect of having many more non-churners). The AUC is also not great with 0.82.

These results point to the conclusion that this is not the best model to approach the churning problem.

We recommend that the company works on developing a better predictive model to address churn. Specifically, we recommend fixing the imbalanced classification problem (the class churners is under-represented compared to non-churners). We also suggest to test other algorithms like classification tree and neural network in an effort to improve the predictive accuracy of churners and the effectiveness of the targeted approach in retaining probable churners.